{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qnpO3iadYEY2"
   },
   "source": [
    "# Ungraded Lab: Building Models for the IMDB Reviews Dataset\n",
    "\n",
    "In this lab, you will build four models and train it on the [IMDB Reviews dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews) with full word encoding. These use different layers after the embedding namely `Flatten`, `LSTM`, `GRU`, and `Conv1D`. You will compare the performance and see which architecture might be best for this particular dataset. Let's begin!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-6PhPXVCa_1i"
   },
   "source": [
    "## Imports\n",
    "\n",
    "You will first import common libraries that will be used throughout the exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WA0Fi9p9ah5_"
   },
   "outputs": [],
   "source": [
    "import tensorflow_datasets as tfds\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BTmnR_9dbBY9"
   },
   "source": [
    "## Download and Prepare the Dataset\n",
    "\n",
    "Next, you will download the `plain_text` version of the `IMDB Reviews` dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "P-AhVYeBWgQ3"
   },
   "outputs": [],
   "source": [
    "# The dataset is already downloaded for you. For downloading you can use the code below.\n",
    "imdb = tfds.load(\"imdb_reviews\", as_supervised=True, data_dir=\"../data/\", download=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JLbfudtjYXGu"
   },
   "outputs": [],
   "source": [
    "# Get the train and test sets\n",
    "train_dataset, test_dataset = imdb['train'], imdb['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BwDDyjFClgGc"
   },
   "source": [
    "Then, you will build the vocabulary based on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vJkuD3_yYGRH"
   },
   "outputs": [],
   "source": [
    "# Vectorization and Padding Parameters\n",
    "\n",
    "VOCAB_SIZE = 10000\n",
    "MAX_LENGTH = 120\n",
    "PADDING_TYPE = 'pre'\n",
    "TRUNC_TYPE = 'post'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0C6_mgFcfQL9"
   },
   "outputs": [],
   "source": [
    "# Instantiate the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)\n",
    "\n",
    "# Get the string inputs and integer outputs of the training set\n",
    "train_reviews = train_dataset.map(lambda review, label: review)\n",
    "\n",
    "# Generate the vocabulary based only on the training set\n",
    "vectorize_layer.adapt(train_reviews)\n",
    "\n",
    "# Delete because it's no longer needed\n",
    "del train_reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3oSlhRQOl5UJ"
   },
   "source": [
    "In Week 2, you generated the padded sequences by chaining `map()` and `apply()` methods. Here's a similar way to do that. You will just call an `apply()` then do the transformations in one preprocessing function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jTSofLPjYjG-"
   },
   "outputs": [],
   "source": [
    "def preprocessing_fn(dataset):\n",
    "  '''Generates padded sequences from a tf.data.Dataset'''\n",
    "\n",
    "  # Apply the vectorization layer to the string features\n",
    "  dataset_sequences = dataset.map(\n",
    "      lambda text, label: (vectorize_layer(text), label)\n",
    "      )\n",
    "\n",
    "  # Put all elements in a single ragged batch\n",
    "  dataset_sequences = dataset_sequences.ragged_batch(\n",
    "      batch_size=dataset_sequences.cardinality()\n",
    "      )\n",
    "\n",
    "  # Output a tensor from the single batch. Extract the sequences and labels.\n",
    "  sequences, labels = dataset_sequences.get_single_element()\n",
    "\n",
    "  # Pad the sequences\n",
    "  padded_sequences = tf.keras.utils.pad_sequences(\n",
    "      sequences.numpy(),\n",
    "      maxlen=MAX_LENGTH,\n",
    "      truncating=TRUNC_TYPE,\n",
    "      padding=PADDING_TYPE\n",
    "      )\n",
    "\n",
    "  # Convert back to a tf.data.Dataset\n",
    "  padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)\n",
    "  labels = tf.data.Dataset.from_tensor_slices(labels)\n",
    "\n",
    "  # Combine the padded sequences and labels\n",
    "  dataset_vectorized = tf.data.Dataset.zip(padded_sequences, labels)\n",
    "\n",
    "  return dataset_vectorized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OhBml1CfYnlj"
   },
   "outputs": [],
   "source": [
    "# Preprocess the train and test data\n",
    "train_dataset_vectorized = train_dataset.apply(preprocessing_fn)\n",
    "test_dataset_vectorized = test_dataset.apply(preprocessing_fn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2wQUKm2qm7CA"
   },
   "source": [
    "View a couple of examples. You should see tuples of tensors with a padded sequence and label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OPKcC3j1cYtE"
   },
   "outputs": [],
   "source": [
    "# View 2 training sequences and its labels\n",
    "for example in train_dataset_vectorized.take(2):\n",
    "  print(example)\n",
    "  print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Lr9vhbN4nY5c"
   },
   "source": [
    "You will do the optimization and batching as usual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iPj0nd67gC-L"
   },
   "outputs": [],
   "source": [
    "SHUFFLE_BUFFER_SIZE = 1000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Optimize and batch the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .cache()\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cs4GDKAFbJdq"
   },
   "source": [
    "## Plot Utility\n",
    "\n",
    "The function below will visualize the accuracy and loss history after training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nHGYuU4jPYaj"
   },
   "outputs": [],
   "source": [
    "def plot_loss_acc(history):\n",
    "  '''Plots the training and validation loss and accuracy from a history object'''\n",
    "  acc = history.history['accuracy']\n",
    "  val_acc = history.history['val_accuracy']\n",
    "  loss = history.history['loss']\n",
    "  val_loss = history.history['val_loss']\n",
    "\n",
    "  epochs = range(len(acc))\n",
    "\n",
    "  fig, ax = plt.subplots(1,2, figsize=(12, 6))\n",
    "  ax[0].plot(epochs, acc, 'bo', label='Training accuracy')\n",
    "  ax[0].plot(epochs, val_acc, 'b', label='Validation accuracy')\n",
    "  ax[0].set_title('Training and validation accuracy')\n",
    "  ax[0].set_xlabel('epochs')\n",
    "  ax[0].set_ylabel('accuracy')\n",
    "  ax[0].legend()\n",
    "\n",
    "  ax[1].plot(epochs, loss, 'bo', label='Training Loss')\n",
    "  ax[1].plot(epochs, val_loss, 'b', label='Validation Loss')\n",
    "  ax[1].set_title('Training and validation loss')\n",
    "  ax[1].set_xlabel('epochs')\n",
    "  ax[1].set_ylabel('loss')\n",
    "  ax[1].legend()\n",
    "\n",
    "  plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bUoZJv02bP0m"
   },
   "source": [
    "## Model 1: Flatten\n",
    "\n",
    "First up is simply using a `Flatten` layer after the embedding. Its main advantage is that it is very fast to train. Observe the results below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_SRAyulSaWAa"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "EMBEDDING_DIM = 16\n",
    "DENSE_DIM = 6\n",
    "\n",
    "# Model Definition with a Flatten layer\n",
    "model_flatten = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),\n",
    "    tf.keras.layers.Flatten(),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Set the training parameters\n",
    "model_flatten.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model_flatten.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tYLZUZ3Ga1ok"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "# Train the model\n",
    "history_flatten = model_flatten.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=(test_dataset_final))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fVPLbqcca6U2"
   },
   "outputs": [],
   "source": [
    "# Plot the accuracy and loss history\n",
    "plot_loss_acc(history_flatten)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2w_soBeUbSXu"
   },
   "source": [
    "## LSTM\n",
    "\n",
    "Next, you will use an LSTM. This is slower to train but useful in applications where the order of the tokens is important."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wSualgGPPK0S"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "EMBEDDING_DIM = 16\n",
    "LSTM_DIM = 32\n",
    "DENSE_DIM = 6\n",
    "\n",
    "# Model Definition with LSTM\n",
    "model_lstm = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),\n",
    "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM)),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Set the training parameters\n",
    "model_lstm.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model_lstm.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "crEvEcQmUQiL"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "# Train the model\n",
    "history_lstm = model_lstm.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=test_dataset_final)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QVwnSYF-aIha"
   },
   "outputs": [],
   "source": [
    "# Plot the accuracy and loss history\n",
    "plot_loss_acc(history_lstm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tcBMGJgzcXkl"
   },
   "source": [
    "## GRU\n",
    "\n",
    "The *Gated Recurrent Unit* or [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) is usually referred to as a simpler version of the LSTM. It can be used in applications where the sequence is important but you want faster results and can sacrifice some accuracy. You will notice in the model summary that it is a bit smaller than the LSTM and it also trains faster by a few seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5NEpdhb8AxID"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "EMBEDDING_DIM = 16\n",
    "GRU_DIM = 32\n",
    "DENSE_DIM = 6\n",
    "\n",
    "# Model Definition with GRU\n",
    "model_gru = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),\n",
    "    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(GRU_DIM)),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Set the training parameters\n",
    "model_gru.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model_gru.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "V5LLrXC-uNX6"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "# Train the model\n",
    "history_gru = model_gru.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=(test_dataset_final))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7kwU-2skSQ3E"
   },
   "outputs": [],
   "source": [
    "# Plot the accuracy and loss history\n",
    "plot_loss_acc(history_gru)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ugToQrB-cfr5"
   },
   "source": [
    "## Convolution\n",
    "\n",
    "Lastly, you will use a convolution layer to extract features from your dataset. You will append a [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer to reduce the results before passing it on to the dense layers. Like the model with `Flatten`, this also trains much faster than the ones using RNN layers like `LSTM` and `GRU`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "K_Jc7cY3Qxke"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "EMBEDDING_DIM = 16\n",
    "FILTERS = 128\n",
    "KERNEL_SIZE = 5\n",
    "DENSE_DIM = 6\n",
    "\n",
    "# Model Definition with Conv1D\n",
    "model_conv = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),\n",
    "    tf.keras.layers.Conv1D(FILTERS, KERNEL_SIZE, activation='relu'),\n",
    "    tf.keras.layers.GlobalAveragePooling1D(),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Set the training parameters\n",
    "model_conv.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model_conv.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "aUV70isnTiFF"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "# Train the model\n",
    "history_conv = model_conv.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=(test_dataset_final))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "T42EmhV0XhRV"
   },
   "outputs": [],
   "source": [
    "# Plot the accuracy and loss history\n",
    "plot_loss_acc(history_conv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UgTIZxoUkv0l"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "Now that you've seen the results for each model, can you make a recommendation on what works best for this dataset? Do you still get the same results if you tweak some hyperparameters like the vocabulary size? Try tweaking some of the values some more so you can get more insight on what model performs best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cell below to free up resources for the next lab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shutdown the kernel to free up resources. \n",
    "# Note: You can expect a pop-up when you run this cell. You can safely ignore that and just press `Ok`.\n",
    "\n",
    "from IPython import get_ipython\n",
    "\n",
    "k = get_ipython().kernel\n",
    "\n",
    "k.do_shutdown(restart=False)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "C3_W3_Lab_4_imdb_reviews_with_GRU_LSTM_Conv1D.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
